suppressWarnings({
  if (! requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
  }
  if (! requireNamespace("Biobase", quietly = TRUE)) {
    BiocManager::install("Biobase")
  }
  if (! requireNamespace("limma", quietly = TRUE)) {
    BiocManager::install("limma")
  }
  if (! requireNamespace("knitr", quietly = TRUE)) {
    BiocManager::install("knitr")
  }
  if (! requireNamespace("ggrepel", quietly = TRUE)) {
    BiocManager::install("ggrepel")
  }
  if (! requireNamespace("gprofiler2", quietly = TRUE)) {
    BiocManager::install("gprofiler2")
  }
})

Introduction

In Assignment #1, I analyzed RNA-seq data from the study Alveolar macrophages from persons living with HIV show impaired epigenetic response to Mycobacterium tuberculosis (Correa-Macedo et al., 2021). Individuals with HIV are at higher risk to have tuberculosis (TB), and this is often due to infection with Mycobacterium tuberculosis (M. tuberculosis) that rapidly progresses to disease (Correa-Macedo et al., 2021). Alveolar macrophages (AMs) are the first cells in the immune system that interact with M. tuberculosis, however their interaction with HIV and antiretroviral therapy (ART) is still unknown (Correa-Macedo et al., 2021). Thus, this study aimed to investigate the transcriptomic and epigenetic response of AMs to M. tuberculosis and how HIV and ART play a part in these mechanisms.

AMs were obtained from 16 control subjects who were HIV-free (HC), 20 persons living with HIV receiving ART (PLWH), and 14 subjects who received ART as preexposure prophylaxis (PrEP) to prevent HIV infection. Each sample was challenged with M. tuberculosis in vitro.

Some basic statistics:

GEO ID: GSE165708

The original dataset had over 60,000 genes. After using edgeR filtering protocols, 13,780 genes were left. 8 genes were duplicates so those were removed. The data was then normalized using the Trimed Mean of M-values (TMM) method. After normalizing, a Multi-Dimensional Scaling (MDS) plot was used to visualize clustering, but no clear clustering by sample group was seen. There is a pattern that the Healthy Control (HC) group (in pink) tends to be on the left side, the Persons living with HIV (PLWH) group (in green) tending to be more on the right, and the pre-exposure prophylaxis (PrEP) group (in blue) towards the bottom middle. However, compared to other clearly-clustered examples, this is definitely a very obscure distinction between sample groups. I also tried plotting by the other conditions, by patient or by challenged status, but the clustering was even more obscure.

library(ggplot2)
mds <- readRDS(file=file.path(getwd(), "data", "mds.rds"))
qplot(X1, X2, color = sample_group, data = mds, main = "Figure 1: MDS plot of Poisson Distances by sample group")

Thus, this notebook aims to take this normalized expression data and: - rank genes according to differential expression; - perform thresholded over-representation analysis (ORA); - and highlight dominant themes in the top set of genes.

Differential Gene Expression

From the first assignment, we know that the possible groups were sample group, patient, and whether or not the sample was challenged. Based on the MDS plot of the normalized data, it seems like there is a pattern that the Healthy Control (HC) group (in pink) tends to be on the left side, the Persons living with HIV (PLWH) group (in green) tending to be more on the right, and the pre-exposure prophylaxis (PrEP) group (in blue) towards the bottom middle. Although this wasn’t a very clear distinction of clusters, but plotting by other conditions, by patient or by challenged status made the clustering even more obscure. Thus, I’m choosing to design my model by sample group (HC, PLWH, and PrEP).

library(Biobase)
library(limma)
library(knitr)
# create design matrix
groups <- readRDS(file=file.path(getwd(), "data", "groups.rds"))
cond_mod <- model.matrix(~ groups$sample_group)

# create data matrix
normed_log_counts <- readRDS(file=file.path(getwd(), "data","normed_log_counts.rds"))
rownames(normed_log_counts) <- make.unique(rownames(normed_log_counts)) # TO-DO: why are there duplicated rownames...? how is that possible?
es <- ExpressionSet(assayData = normed_log_counts)

# test for differential expression
fit <- lmFit(es, design = cond_mod)
fit2 <- eBayes(fit, trend = TRUE) # calculates p-values too

# sort by p-value
pval_thres <- 0.05 
fit_pvals <- data.frame(fit2$p.value)
colnames(fit_pvals) <- c("pvalue", "sample_groupPLWH", "sample_groupPrEP")
fit_pvals <- fit_pvals[order(fit_pvals$pvalue), ]
fit_pvals$pvalue <- as.character(fit_pvals$pvalue)
kable(fit_pvals[1:10, ], row.names = TRUE, format = "markdown",
              caption = "Table 1: Top 10 genes with lowest p-values")
Table 1: Top 10 genes with lowest p-values
pvalue sample_groupPLWH sample_groupPrEP
RHOA 2.82526088386643e-161 0.8062366 0.8007028
TMBIM6 2.39015331413043e-156 0.0369018 0.9921279
SRPRA 3.68140366727055e-155 0.0123199 0.8398919
PGK1 1.98918865990842e-153 0.0303051 0.6413137
EIF4H 1.84135285166656e-152 0.2096304 0.4373866
ARPC2 3.16706223053355e-152 0.0069919 0.0007316
PKM 7.13556529863017e-152 0.0257550 0.1241887
UBA1 2.35134329445641e-150 0.9339848 0.2785843
PCBP1 4.44693529627169e-150 0.0335999 0.3530383
YWHAB 2.8893887222484e-149 0.2278528 0.0239897

We can see from table 1 that just the top 10 genes with smallest p-values have nearly 0 as p-values, which means that for these genes at least, the null hypothesis is rejected (there is strong differential expression).

There are exactly 646 genes that have a p-value of less than 0.05, which is 4.6957912%.

However, it’s important to note these statistics are before correcting p-values for multiple testing. We’ll adjust for this using the Benjamini-Hochberg Procedure, which controls for the fact that some small p-values may occur by chance (Glen, n.d.).

# correct p-values with mult hyp correction, sorted by adjusted p-value
mhc <- topTable(fit2, coef = ncol(cond_mod), adjust.method = "BH", 
                number = nrow(es), sort.by = "p")
knitr::kable(mhc[1:10, c("P.Value", "adj.P.Val")], row.names = TRUE, format = "markdown", digits = 32, caption = "Table 2: Top 10 genes with lowest adjusted p-values")
Table 2: Top 10 genes with lowest adjusted p-values
P.Value adj.P.Val
AL162151.2 4.375757e-10 5.989604e-06
RAB5C 8.707719e-10 5.989604e-06
DVL2 1.312518e-09 6.018771e-06
CCAR1 5.439819e-09 1.870890e-05
MBD4 9.506086e-09 2.615505e-05
CBX3 1.564460e-08 3.587045e-05
FAHD1 2.719752e-08 5.345089e-05
KPNB1 4.709188e-08 8.098038e-05
HSPA14 7.069825e-08 1.080662e-04
RPL41P2 9.028291e-08 1.242022e-04

Based on table 2, we see that the 10 genes with the lowest p-values are completely different. Now, there are 894 genes that have a p-value of less than 0.05, which is 6.4985098%.

# referred to https://biocorecrg.github.io/CRG_RIntroduction/volcano-plots.html
# volcano plot, highlight genes of interest
library(ggrepel)
options(ggrepel.max.overlaps = Inf)
log_thres <- 1.5

mhc$diffexpressed <- "none"
mhc$diffexpressed[mhc$logFC > log_thres & mhc$P.Value < pval_thres] <- "up" # up-regulated
mhc$diffexpressed[mhc$logFC < -log_thres & mhc$P.Value < pval_thres] <- "down" # down-regulated

mhc$delabel <- NA
mhc$gene_name <- rownames(mhc)
mhc$delabel[mhc$diffexpressed != "none"] <- mhc$gene_name[mhc$diffexpressed != "none"]

volc <- ggplot(data = mhc, aes(x=logFC, y=-log10(P.Value),
               col = diffexpressed, label = delabel)) +
  geom_point() +
  theme_minimal() +
  geom_text_repel(force = 0.8) +
  labs(title = "Figure 2: Volcano Plot highlighting Differentially-Expressed Genes",
       x = "log fold change", y = "-log10(p-value)", 
       col = "Differential Expression") +
  scale_color_manual(values = c("blue", "black", "red"), labels = c("down-regulated", "none", "up-regulated")) +
  geom_hline(yintercept = -log10(pval_thres), col = "red") +
  geom_vline(xintercept = c(-log_thres, log_thres), col = "red") 
volc

Out of 13757 genes total, 11 genes were up-regulated, and 17 genes were down-regulated.

kable(mhc[mhc$diffexpressed == "up", c("logFC", "P.Value", "adj.P.Val")],
      caption = "Table 3: Up-regulated genes",
      col.names = c("log fold change", "p-value", "adjusted p-value"))
Table 3: Up-regulated genes
log fold change p-value adjusted p-value
AC099329.1 3.465474 0.0001005 0.0085155
DDX3Y 2.601536 0.0001852 0.0114606
HLA-J 1.782034 0.0003757 0.0163025
KDM5D 2.490680 0.0003956 0.0168994
RPL13P12 2.119959 0.0004586 0.0184508
RPS4Y1 1.989445 0.0012256 0.0302693
LINC01168 1.866820 0.0016165 0.0344900
TXLNGY 1.959320 0.0016625 0.0346872
USP9Y 1.850793 0.0037492 0.0542427
POLRMTP1 1.524621 0.0042310 0.0582409
CES1P1 1.677667 0.0086981 0.0835028
kable(mhc[mhc$diffexpressed == "down", c("logFC", "P.Value", "adj.P.Val")],
      caption = "Table 4: Down-regulated genes",
      col.names = c("log fold change", "p-value", "adjusted p-value"))
Table 4: Down-regulated genes
log fold change p-value adjusted p-value
EMB -1.715152 0.0000212 0.0038315
MMP12 -1.787043 0.0001557 0.0103480
IRF4 -1.514623 0.0004954 0.0193599
UBD -1.958963 0.0013558 0.0317754
C2orf27A -1.517468 0.0014261 0.0324276
CCL17 -1.608707 0.0016722 0.0347043
IDO1 -2.189067 0.0018622 0.0364743
TRBC2 -1.816124 0.0028971 0.0469992
HLA-DRB6 -1.801385 0.0043702 0.0590011
TRAC -1.621051 0.0052988 0.0642004
SPOCK2 -1.584854 0.0084706 0.0821215
IL2RB -1.700648 0.0108988 0.0947113
HLA-W -1.948183 0.0109724 0.0950626
IKZF3 -1.568477 0.0132272 0.1030976
CXCL9 -1.791222 0.0178988 0.1230552
CD2 -1.581964 0.0252116 0.1480786
ACOD1 -1.513909 0.0324162 0.1703998
# visualize top hits with heatmap
library(ComplexHeatmap)
hm_matrix <- normed_log_counts[which(mhc$diffexpressed == "up" | mhc$diffexpressed == "down"),]
hm_matrix[which(!is.finite(hm_matrix))] <- 0 # remove infinite value
hm_matrix <- t(scale(t(hm_matrix))) # scale to normalize

# referred to: https://jokergoo.github.io/ComplexHeatmap-reference/book/heatmap-annotations.html#block-annotation
Heatmap(hm_matrix, column_order = order(sapply(strsplit(colnames(hm_matrix),"\\."), `[`, 4)),
        name = "normalized log counts",
        column_title = "Figure 3: Top Genes by Differential Expression",
        row_names_gp = gpar(fontsize = 6),
        column_names_gp = gpar(fontsize = 5),
        column_split = groups$sample_group,
        top_annotation = HeatmapAnnotation(sg = anno_block(gp = gpar(fill = 0),
          labels = c("HC", "PLWH", "PrEP"), 
          labels_gp = gpar(col = "black", fontsize = 8))))

This heatmap is ordered column-wise by patient, with every two columns representing one patient (every other column is the challenged sample). We can see that there is very little clustering by patient. Each block represents one sample group (HC, PLWH, or PrEP). The left half of each block are non-challenged samples (sample name ends in a 0), while the right half are challenged samples (sample name ends in a 1).

1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?

There are exactly 646 genes that have a p-value of less than 0.05, which is 4.6957912%. I chose to stay with the conventional p-value threshold of 0.05 based on arguments by Di Leo & Sardanelli (2020). Achieving enough power with a lower threshold for p-value would require larger sample sizes, which can be difficult especially with this line of research and acquiring the type of data that this study uses (Di Leo & Sardanelli, 2020). I used a 1.5 fold change threshold to determine whether genes were up or down regulated, based on the opinions of a couple forum posts (Sulaiman, 2018).

2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?

I chose to use the default Benjamini-Hochberg (BH) method for correction as it is recommended as one of the most superior in terms of user-friendliness and documentation, and since it is the default in limma too so using this method could increase probability of potential comparison with other analyses (Korthauer et al., 2019; Ritchie et al., 2015). IHW was also ranked as a top method, but I chose against this since it demonstrated low False Discovery Rate (FDR) compared to other methods (Korthauer et al., 2019). However, IHW did have very similar patterns to BH in terms of FDR control, applicability, and usability (Korthauer et al., 2019). 894 genes passed correction.

3. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.

See above Volcano plot.

4. Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.

We see that there are a group of genes in the top half of the map that are more up-regulated in healthy control that are more down-regulated in the PLWH (Persons living with HIV) sample group. In the PrEP (preexposure prophylaxis) group, we do not see as much of this pattern. When I tried to order the columns by the other conditions (like by patient or challenged status), there was no distinguishable clustering. This clustering by sample group is expected based on the MDS plot from earlier.

Thresholded Over-Representation Analysis

library(gprofiler2)
# write thresholded lists of genes
upreg_genes <- mhc$gene_name[mhc$diffexpressed == "up"]
write.table(x = upreg_genes,
            file = file.path("data", "upregulated_genes.txt"))

downreg_genes <- mhc$gene_name[mhc$diffexpressed == "down"]
write.table(x = downreg_genes,
            file = file.path("data", "downregulated_genes.txt"))
# calculate rank
mhc$rank <- -log(mhc$P.Value, base = 10) * sign(mhc$logFC)
ranked_genes <- mhc[, c("gene_name", "rank")]
write.table(x = ranked_genes,
            file = file.path("data", "ranked_genes.txt"))

all_genes <- mhc$gene_name[mhc$diffexpressed != "none"]
write.table(all_genes, file = file.path("data", "all_genes.txt"))

gprof <- gost(query = all_genes, organism = "hsapiens",
              sources = c("GO:BP", "KEGG", "HPA", "HP", "REAC")) # use all genes from array as background
gprof_plot <- gostplot(gprof, capped = FALSE, interactive = T)
gprof_plot
gprof_up <- gost(query = upreg_genes, organism = "hsapiens",
                 sources = c("GO:BP", "KEGG", "HPA", "HP", "REAC")) 
gprof_up_plot <- gostplot(gprof_up, capped = FALSE, interactive = T)
gprof_up_plot
gprof_down <- gost(query = downreg_genes, organism = "hsapiens",
                   sources = c("GO:BP", "KEGG", "HPA", "HP", "REAC")) 
gprof_down_plot <- gostplot(gprof_down, capped = FALSE, interactive = T)
gprof_down_plot

1. Which method did you choose and why? I chose to use Over-Representation Analysis (ORA) as the gene set enrichment analysis method since it is one of the most widely used and simple (Maleki et al., 2020).

2. What annotation data did you use and why? What version of the annotation are you using?

I used most data sources available for the homo sapiens organism since the data is from human cells. The authors used GO-term enrichment analysis too, so including GO would help me compare my results with theirs. I excluded the datasets that were not relevant to the study, such as GO:MF (molecular function), GO:CC (cellular component), etc. I also included the KEGG and REAC databases since the original study also performed gene ontology enrichment analysis with that. The version of data is the most updated based on g:Profiler, with the gprofiler2 package version being 0.2.1.

3. How many genesets were returned with what thresholds?

17 genesets were returned with the default threshold 0.05, which is the same p-value significance threshold I used throughout this notebook.

4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

In the analysis of just the up-regulated set of genes, 3 genesets were returned. All of these genesets were from the Human Phenotype Ontology (HP) datasource. For the down-regulated set, there were 27 genesets returned. One geneset was from the biological process source (GO:BP) and two were from the Kyoto Encyclopedia of Genes and Genomes (KEGG).

Interpretation

1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

In the original paper, they noted 40 genes that were differentially expressed at FDR less than or equal to 5% in response to M. tuberculosis (TB) in the PLWH and PrEP groups (Correa-Macedo et al., 2021). They may have more differentially-expressed genes by using different logFC threshold or methods of normalization. Interestingly, they used the same FDR method (Benjamini-Hochberg), but they applied it to raw P values while I applied it to the values from the fitted model which could have made a difference.

Following in vitro challenge with TB, AMs from each group displayed overlapping but distinct profiles of significantly up- and downregulated genes in response to TB. They reported that AMs from the PLWH and PrEP subjects had substantially weaker transcriptional response. We can see similar results from the heatmap that genes are generally more downregulated/have weaker transcriptional response in both the PLWH and PrEP groups, while being more upregulated in the healthy controls.

They also saw that the magnitude of AM transcriptional response to TB differed across the 3 sample groups, with the smallest mean absolute log fold change in response to MTB among PrEP subjects, and the strongest transcriptional response in HCs. We do not see this difference very clearly, as based on the heatmap the challenged samples in the PrEP group do not seem to have any major difference in log fold change.

PLWH had stronger transcriptional response than PrEP subjects. They also saw that logFC values were consistently higher for corresponding genes from HC subjects, suggesting a transcriptional impairment in PLWH and PrEP subjects. We do see this result in our heat map as the HC group visually looks more upregulated than the other groups, meaning they had higher logFC values.

They also reported number of significant GO terms/pathways to be substantially lower for PLWH and PrEP groups. Upon further analysis, it would be interesting to compare the pathway results across sample groups instead of just up- and downregulated genes.

The top term returned from GO:BP is “response to cytokine,” from KEGG it’s “Viral protein interaction with cytokine and cytokine receptor,” and from HP it’s “Y-linked inheritance.”

For up-regulated genes, the top term is from HP: “Y-linked inheritance.”

For down-regulated genes, the top term is from GO:BP: “response to cytokine.”

The original paper did do a cytokine analysis and found that AMs from the PLWH and PrEP groups had reduced cytokine secretions levels compared to the HC groups. This suggests there is a differential role of cytokines in HIV and PrEP.

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results. HIV infection targets the immune system, leading to immunodeficiency (Le Saout, Lane, & Catalfamo, 2012). Since the top terms from two of the datasets were related to cytokines, this points to an importance of cytokines with HIV or TB infection. Cytokines have an important role in the immune system, and HIV infection actually leads to dysregulation of cytokine profile (Kedzierska & Crowe, 2001). During the course of HIV-1 infection, secretion of certain T-helper type 1 cytokines like is decreased while T-helper type 2 cytokines is increased (Kedzierska & Crowe, 2001). In fact, one of the T-helper type 1 cytokines is antiviral interferon gamma, which is one of our top hits from GO:BP.

Most of the other top hits are also just related to immune response or cytokine production regulation, which follows from how infection occurs.

The topic of the top term from HP, “Y-linked inheritance,” like sex-links or sex chromosomes, was interestingly not really talked about in the original paper. However, the Y chromosome is said to have a prominent role in determining outcomes of HIV infection (Maan et al., 2017).